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Abstract 

Background: To examine whether lacl< of measurement invariance (Ml) influences mean comparisons among different 
disease groups, this paper provides (1) a systematic review of Ml in generic constructs across chronic conditions and 
(2) an empirical analysis of Ml in the Health Education Impact Questionnaire (heiQ™). 

Methods: (1) We searched for studies of Ml among different chronic conditions in online databases. (2) Multigroup 
confirmatory factor analyses were used to study Ml among five chronic conditions (orthopedic condition, rheumatism, 
asthma, COPD, cancer) in the heiQ™ with N = 1404 rehabilitation inpatients. Impact on latent and composite mean 
differences was examined. 

Results: (1) A total of 30 relevant studies suggested that about one in three items lacked Ml. However, only four 
studies examined impact on latent mean differences. Scale means were only affected in one of these three studies. (2) 
Across the eight heiQ™ scales, seven scales had items with lack of Ml in at least one disease group. However, in only 
two heiQ™ scales were some latent or composite mean differences affected. 

Conclusions: Lack of Ml among disease groups is common and may have a relevant influence on mean comparisons 
when using generic instruments. Therefore, when comparing disease groups, tests of Ml should be implemented. More 
studies of Ml and according impact on mean differences in generic questionnaires are needed. 
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Background 

Generic questionnaires are based on the idea that import- 
ant aspects of patients can be described across different 
chronic conditions. One such instrument, the Health Edu- 
cation Impact Questionnaire (heiQ™), aims to measure 
proximal outcomes of self-management programs across 
disease groups on eight disparate constructs, ranging from 
emotional distress to navigating the healthcare system. 
Ideally, the measurement properties of generic tools 
should be stable across disease-related characteristics, a 
property known as measurement invariance (MI) [1]. 
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MI is often studied among gender, age or ethnic groups 
[2,3], but only little is known about MI across different 
chronic conditions. This paper helps to close this gap in 
the literature. The main research questions of this paper 
are, whether non-invariant items in generic questionnaires 
across different chronic conditions are a common finding 
and whether non-invariant items influence the validity of 
substantial statistical analyses with these questionnaires. 
First, the concept of MI and some important aspects of 
investigating MI are described. Second, a systematic review 
of studies that examined MI across different chronic 
conditions is presented. Third, the paper contains an 
empirical analysis of MI of the German version of the 
heiQ™. Results from the systematic review facilitate the in- 
terpretation of the results of the heiQ™ MI analyses. 
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Measurement invariance 

MI is the property of a measure being influenced sys- 
tematically only by the construct that is intended to be 
measured. That is, no other characteristic of the persons 
being measured (for example gender or disease group) or 
the assessment context should have a systematic influence 
on the measurement results [4]. Therefore, persons with 
the same level in the construct of interest are expected 
to have the same numerical values in the measure. If 
MI does not hold between two or more groups in a 
measure, estimates of mean differences between these 
groups [5], correlations with other constructs [3] or 
selection decisions based on cut-off values [6] may be 
biased. It may even be questionable whether the instru- 
ment measures the same construct among comparison 
groups [5]. Therefore, MI is regarded as a prerequisite 
for group comparisons [1,7]. 

In the literature, a range of different concepts has been 
assigned to MI, for example "item bias" or "differential item 
functioning" (DIF) [4,7,8]. Although these concepts differ in 
some nuances from MI [4,5], they are used interchangeably 
for the purposes of this article. Furthermore, different 
statistical test procedures were developed to examine 
MI, some of which are based on observable variables, 
while others are based on latent variable models such 
as item response theory (IRT) or the common factor 
model [8,9]. Most of them follow the "...'matching 
principle': systematic group differences in scores on a 
scale or item are considered as evidence of measurement 
bias only if group differences in scores remain among 
individuals who are all matched on the construct or latent 
variable being measured by the scale or item" ([9], p. 
S171). When using latent variable models, MI refers to 
invariant model parameters, e.g. factor loadings or item 
difficulties [7]. Unfortunately, different statistical methods 
can lead to different results; a "... true criterion ...[to detect 
violations of MI did not]... stand up" ([10], p. S177). 
However, three aspects should be taken into account 
when studying MI: type of parameter [11], magnitude 
and impact [12]. 

Type of parameter refers to those parameters that can 
show DIF [8]. For example, multigroup confirmatory factor 
analysis (CFA) allows separating and testing different levels 
of MI, defined by the kind of model parameters that are 
restricted to be invariant across groups. To establish 
configural invariance, merely the number of latent variables 
and assignments of indicators on these latent variables 
have to be the same in all groups. Metric invariance is 
defined by invariant factor loadings, while scalar invariance 
is defined by metric invariance plus invariant intercepts. 
Finally, strict invariance is defined by additionally invariant 
residual (co-)variances [1,11,13]. If one or more parameters 
were non-invariant, partial invariance models can be 
tested, in which only some parameters on each level are 



restricted to be invariant [14]. At least (partial) scalar 
invariance has to be established to compare means of 
latent variables, while (at least partial) strict invariance 
is needed for mean comparisons in manifest variables 
to be permissible, e.g. composite scores [15-17]. Notably, 
in IRT-models, item discrimination parameters and item 
difficulty parameters can be viewed as counterparts of 
factor loadings and intercepts in common factor models, 
respectively [7,18]. DIF in item difficulty parameters is 
sometimes labeled "uniform" bias, while DIF in item 
discrimination parameters is called "non-uniform" bias [8]. 
DIF in residual variances is not tested in IRT models, as 
IRT models imply equal residual variances [8]. 

Magnitude, as defined here, refers to the size of differ- 
ences in non-invariant parameters between groups, while 
impact designates the influence of non-invariant param- 
eters on the main research questions, for example on 
mean differences in composite scores [10,19]. A researcher 
may detect a non-invariant factor loading of relevant 
magnitude (e.g., above 0.2 [20]) in one item of a scale. 
However, it is still possible that the mean group difference 
in the composite (scale) score is only marginally affected 
(small "impact"). The relationship between magnitude 
and impact is not quite clear. Some studies suggest that, 
in general, an increase in magnitude increases impact 
[3,5,21]; however, other aspects like the number of items 
in a scale, direction of invariant parameters, size of other 
model parameters or type of parameter may moderate 
this relationship. For example, Steinmetz [5] found that 
non-invariant intercepts may have a greater impact on 
mean comparisons compared to non-invariant factor 
loadings. Chen [3] showed that effects of multiple non- 
invariant parameters on mean differences may cancel 
each other out when the direction of invariant parameters 
is mixed, i.e. some parameter values are higher in the 
reference group and some are lower [10]. Although a 
general conclusion regarding the relationship between 
magnitude and impact is difficult to make, studies of meas- 
urement invariance should take both features into account. 

In the last 20 years, many studies have been published 
to test MI in a variety of instruments in the social and 
health sciences. The majority of these studies examined 
MI in gender, age, language or culture [2]. Reviews of 
MI studies have shown that lack of MI is a common 
finding: In a review of cross-cultural MI, Chen [3] found 
that 74% of reviewed studies showed non-equal factor 
loadings in at least one item. According to Schmidt et al. 
[2] half of the reviewed studies tested partial invariance 
models, indicating that these studies found at least one 
non- invariant parameter. 

In the health sciences, Teresi et al. [22] reviewed studies 
of MI for measures of depression, quality of life and 
general health. The main question was whether MI 
could be detected in the studied constructs (across any 
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comparison groups) and whether the methods used to 
detect MI were appropriate. Only six of the reviewed 
studies examined MI across disease groups. Half of all 
studies did not examine all relevant types of MI. That is, 
magnitude and impact were often studied, but with differ- 
ing results: Some studies reported only minor impact, 
while others reported non-ignorable impact. The review 
was restricted to methods based on observable variables 
and IRT models; methods based on the common factor 
model were not included. 

To date, no systematic review examined whether disease 
group is associated with MI. However, MI across disease 
groups is of special interest in health science for several 
reasons: First, lack of MI might bias mean comparisons 
between different conditions in a generic construct. 
Second, lack of MI might also bias structural relation- 
ships between different constructs in different disease 
groups [3]. And finally, lack of MI might bias selection 
decisions based on cut-off values [6]. 

In the following section, a systematic review summa- 
rizes the knowledge in the scientific literature about MI 
in generic instruments across different chronic conditions. 
Then, an empirical investigation of MI among five different 
chronic conditions using the heiQ™ is presented. After- 
wards, results of both studies are discussed. 

Systematic review 

Research questions 

The systematic review tries to find out whether chronic 
condition should be regarded as a serious threat to MI in 
generic instruments. To explore this, the following main 
research questions were posed: 

1) In general, how many items (in relation to the total 
number of items in an instrument) were regarded as 
non-invariant by the identified studies? 

2) Do the identified non-invariant items have an impact 
on mean differences or other substantial statistical 
parameters? 

Furthermore, the following questions should also be 
answered by the review: 

How many studies can be identified that examined 
measurement invariance in generic instruments? Which 
constructs were examined, which chronic conditions were 
compared and which statistical methods used? What are 
the common explanations for lack of MI and what was 
recommend as the best ways to deal with it? Do some 
aspects of the studies (e.g. examined construct, number 
of comparison groups) correlate with the number of 
DIF-Items? 

In contrast to other reviews [2,3,22,23], this review was 
not restricted to special statistical methods, for example 
CFA, or to a special time period. 



Methods 

Studies were identified by searching electronic databases 
(Medline via both Pubmed and Ovid, Psyclnfo) and by 
checking reference lists in identified studies and reviews 
[2,3,22,23]. Electronic search was performed on 29 August, 
2012. As it was expected that results would contain many 
studies from areas other than health sciences (for example 
organizational research), results were filtered accordingly. 
Search and filter terms as well as inclusion and exclusion 
criteria are shown in Table 1. 

First, titles and abstracts were screened by one reviewer 
(MS). Then, full- text articles of all potentially relevant 
papers were retrieved. Two independent reviewers (MS; 
GM) determined eligibility of the studies. 

Number of DIF-Items in relation to the whole number 
of items per questionnaire was determined (0-100%). 
Kendalls x correlation coefficients were computed between 
number of DIF-Items and examined construct, number of 
comparison groups, number of persons in the study, mean 
number of persons per comparison group. 

Results 

Study selection 

The search of electronic databases retrieved 4,017 refer- 
ences. After filtering, 2,014 studies remained and were 
evaluated on the basis of title and abstract. 91 potentially 
relevant references were identified. After examination of 
full-texts, a total of 30 studies were included. Interrater- 
reliability in the second step was moderate (Yules Y = 0.70) 
but all disagreements could be resolved by discussion. 
All relevant data of the studies are presented in Additional 
file 1: Table SI, online-supplement. 

Constructs and instruments 

A variety of constructs were examined by the reviewed 
studies: physical functioning [24-32], depression [33-36], 
illness-related distress [37], somatization [33], mental 
health [31], pain [38], manual ability [39], daily activities 
[40-42], mobility and self-care [43], quality of life [44], 
health status [45], breathless severity [46], kinesiophobia 
[47], dementia [48], patients opinion about their doctor 
[49], caregiver reactions [50], stigmatization [51], physicians 
empathy [52] and satisfaction [53]. 

Three instruments or scales (FIM, HAQ-DI, SF-36 Phys- 
ical Functioning scale) were examined in more than 
one study. 23 of the examined measures were validated 
questionnaires or scales; six studies report the development 
of a questionnaire and two studies examined an item bank. 
One study examined two measures. 

Number of patients and disease groups 

In total, 34,608 patients were examined (M = 1,154, 
Md = 538). Most studies compared two (n = 13) or three 
(n = 11) disease groups, six studies compared five or more 
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Table 1 Search terms, filter terms and inclusion/exclusion criteria 



Search terms "Measurement invariance", "factorial invariance", "measurement equivalence", "differential item functioning", "item bias" 

Filter terms Chronic*, diagn*, patient*, rehab*, cancer, arthrit*, inflam*, diab*, rheum*, orthop*, respir*, asthm*, copd, health, 

quality of life, self management, self-management, empowerment, diseas*, depress*, anxiety, trauma, injury 

Inclusion criteria (a) empirical study of Ml among different chronic conditions 

(b) generic questionnaire 

(c) adults 

(d) English or German language 

Exclusion criteria (a) only Ml between factor correlations were studied, although scales were not combined to a total score; 

(b) instruments measure disease-related constructs such as disease-specific quality of life; 

(c) only specific subgroups of a chronic conditions were studied (e.g., patients with right- vs. left-hemispheric lesions). 

Note: *was used as search term. 



groups. The mean sample size per group was N = 343 
(Md = 193). Generally, many different disorders were 
compared, while most studies included at least one 
neurological disorder. 

Statistical methods 

Most studies (n = 22) used methods based on IRT, six 
studies used common factor models and two studies 
used other statistical methods. Four studies investigated 
only metric or configural invariance. Only eight studies 
examined at least scalar invariance (i.e., both uniform 
and non-unifom DIF). 

Number of invariant items, magnitude, impact and 
recommendations 

On average, 31% (Md = 27%, Min = 0%, Max = 85%) of the 
items showed DIF. Excluding those studies that studied 
configural or metric MI only, DIF was found in 36% of the 
items. In 25 of the examined questionnaires (81%), at least 
one item showed DIF. 16 studies reported indicators of 
magnitude, e.g. item difficulty parameters in disease 
groups. However, 15 studies reported only p-values or 
no indicators of magnitude. 

Of the 24 studies that identified at least one non- 
invariant item, only three examined impact on latent mean 
differences (none on composite mean differences). One of 
them reported statistically significant and relevant impact 
(d>0.2, see below). However, 13 studies recommended 
adjusting for DIF or to be "cautious" when comparing 
means between or combining data across disease groups. 
Five studies examined correlations between adjusted and 
non-adjusted estimates. Generally, very high correlations 
(>0.99) were reported indicating that structural relation- 
ships with other variables may not be affected when 
ignoring DIF. None of the studies examined impact on 
selection of patients according to cut-off-values. 

Explanations for DIF 

A total of 15 studies gave some explanations for non- 
invariant items. Most of them seemed to interpret DIF 



as reflections of real clinical differences. For example, in 
a study of Dallmeijer et al. [25], patients with stroke 
showed higher item difficulty in the SF-36 item lifting/ 
carrying groceries' "... than patients with other multiple 
sclerosis or amytrophic lateral sclerosis, which is explained 
[...] by the unilateral impairment of the arms of stroke 
patients" (p. 168). Besides, some authors also reported that 
undetected multidimensionality [27,36,37] or misworded 
items [27,41] might cause DIF and some further referred 
to other studies with similar results [28,32,34,43,45]. 

Studies examining physical fiinctioning in a broader sense 
(e.g. including manual ability or daily activities) showed 
significant higher number of DIF-items (x = 0.45). All other 
aspects of the studies showed no correlations with number 
ofDIF-Items (all t < |0.08|). 

Summary 

Ml was examined across a variety of chronic conditions in 
many different constructs. DIF between disease groups 
in at least one item of a scale appears to be common. 
However, despite frequent recommendations to pay atten- 
tion to items with DIF (or to delete them), only few 
studies explicitly examined impact of DIF on latent or 
composite mean differences. 

Empirical investigation of Ml in the heiQ™ 

Research question 

The empirical investigation of MI in the heiQ™ was car- 
ried out among five chronic conditions (orthopedic con- 
ditions, rheumatism, asthma, COPD and cancer) and 
gender. Multigroup CFAs were used to test different levels 
of invariance. If non-invariant parameters were found, 
impact on latent and composite mean differences were 
examined via effect size measures. 

Methods 
Sample 

Patients from seven rehabilitation hospitals with a range 
of medical conditions (cancer, inflammatory bowel dis- 
ease, orthopedic condition, respiratory disease, rheumatic 
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disease) were included. All Patients completed heiQ™ at 
the beginning of inpatient rehabilitation. Parts of the 
patients were a subsample of patients from the study 
presented in [54]. The project was approved by the ethical 
review committee of Hannover Medical School (Nr. 5070). 
Participation in the study was voluntary and based on 
written informed consent. 

The Health Education Impact Questionnaire (heiQ™) 

The heiQ™ was developed in Australia and measures 
proximal outcomes of self-management programs. It 
contains 40 items (4-point response scale) across eight 
independent scales: Positive and active engagement in life, 
Health directed activities, Skill and technique acquisition. 
Constructive attitudes and approaches, Self-monitoring 
and insight. Health service navigation. Social integration 
and support, and Emotional distress. The scales were 
developed using CPA and item response theory [55]. In 
the German version, the factorial structure was replicated 
with only minor adjustments (i.e. freeing error covariances 
between two items in five scales each) [54]. Generally, 
higher values in the heiQ™ scales indicate better status, 
except for Emotional distress, in which higher values 
indicate higher distress. The scales show appropriate 
associations with constructs like subjective health, depres- 
sion or cognitive and emotional representations of an 
illness [54]. The heiQ™ can be used to display the effects 
of self-management programs in outpatient and commu- 
nity settings [56-59] and was recently used to guide a 
Cochrane Review of self-management programs [60]. 
Further information on the heiQ™ can be found in [55,61]. 

Both in Australia and in Germany, factorial validity 
was examined in about 1200 rehabilitation patients with 
a variety of chronic conditions, respectively. Nolte et al. 
[62] examined MI over time (response-shift [63]) in the 
heiQ™. Although using a sample that included different 
chronic conditions, this study suggested remarkably stable 
psychometric properties of the heiQ™ over time. However, 
statistical models can show good fit values in heterogeneous 
samples even though subsamples may have different 
parameter values [64]. Therefore, the results of these 
studies cannot be interpreted as evidence of MI between 
chronic conditions. 

Data analysis 

To test different levels of MI, several multigroup CPAs 
were computed. All analyses were done with Mplus Version 
6.1 [65] using robust maximum likelihood estimator. MI 
was examined for each scale separately. The measurement 
models of the German heiQ™ were used as baseline 
models to test configural invariance. To identify the 
models, the procedure suggested by Yoon & Millsap [66] 
was used: Por testing configural invariance, the factor 
loadings of one indicator item was set to 1 (the same item 



in all groups) and the mean of the latent variable was fixed 
to zero in all groups. All other parameters were free to 
vary among groups. To test for metric invariance, the 
variance of the latent variable in the reference group 
was set to 1 and all factor loadings were fixed to be in- 
variant between groups (the mean of the latent variable 
was still fixed to zero in all groups). Scalar invariance 
was tested by additionally restricting all intercepts to be 
equal between groups; the mean of the latent variable 
was still fixed to zero in the reference group but was 
allowed to vary across all other groups. Pinally, strict 
invariance was tested by restricting all residual variances 
(and covariances between residual terms) to be invariant 
among all comparison groups. 

Configural invariance was assessed by global evaluation 
of model accuracy using chi^-test as well as the model fit 
indices Comparative fit index (CPI) and Root mean square 
error of approximation (RMSEA). Por model fit to be 
interpreted as at least acceptable', CPI should be close 
to 0.95 or above and RMSEA close to 0.06 or below 
[67]. PoUowing Saris et al. [20], metric, scalar and strict 
invariance of parameters (factor loadings, intercepts, 
residual variances) were evaluated by expected param- 
eter changes (EPC) and modification indices using the 
software JruleMplus [68]. A modification index can be 
regarded as a test statistic for a significance test (with 1 
degree of freedom) for a misspecification (e.g., a fixed 
factor loading) and an EPC offers an estimate of that 
misspecification. Using the formulas provided by Saris 
et al. [20], we tested whether a potential misspecification 
exceeds a reference value 5. 5 is determined by the 
researcher and represents the size of a misspecification 
regarded as relevant. In studies of MI, 5 represents the 
minimal difference in factor loadings, intercepts etc. 
among comparison groups that are regarded as mean- 
ingful, respectively. In other words, 6s represent the 
lower limits of magnitudes of non-invariant parameters 
while EPCs are estimates of actual magnitudes. However, 
there are no rules of thumb for choosing appropriate 
critical values for equally constraints [69,70] . Por example, 
Steinmetz [5] found that in scales with four or six items, 
differences in (unstandardized) factor loadings of 0.3 in 
one or two items may have only small, but differences in 
intercepts of 0.075 times the scale range may have consid- 
erable impact on latent and composite mean differences. 
To be on the safe side, 5 was fixed on 5 =0.15 for (unstan- 
dardized) factor loadings and error variances and to be 
0.04 times the scale range of the latent variable (5 = 0.12) 
for intercepts. Purthermore, the conclusion drawn by 
the analysis must take the power of the modification 
index test into account, which can be computed for every 
combination of modification index, EPC, 5 and signifi- 
cance level alpha (which was fixed at alpha = 0.05 in this 
study). We followed Saris et al. [20] and regarded results 
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based on tests with low power (<0.8) and nonsignificant 
modification indices (i.e. modification indices < 3.84), as 
"inconclusive", which means that it is not possible to 
decide whether the misspecification exceeds 5 or not, i.e. 
whether the examined parameter is invariant or not. For 
these parameters, impact on mean differences was not 
examined (see below). For more details on the outlined 
procedure, see [20,69,71]. Whenever DIF was found in 
a parameter, the parameter was set free and partial 
invariance models were tested. When more than one 
parameter was found to be non-invariant, the parameter 
with the highest EPC was set free and the new model was 
tested. When JruleMplus still identified non-invariant 
parameters, the procedure was repeated until no further 
misspecification was indicated. 

The impact of non-invariant parameters on latent mean 
differences was tested via comparison of mean group 
differences between partial measurement invariance 
models (PIM) and strict invariance model (SIM). PIM 
were regarded as the "true" models, while SIM (wrongly) 
assumes that all parameters were invariant across all 
groups. Standardized mean differences in latent variables 
[72] between comparison groups were computed in 
both SIM (SiDiff) and PIM (Ploiff). Then the term ESsi. 
PI = SlDiff-PIoiff was computed. ESsi-pi represents the 
size of misestimating the standardized mean difference 
between two comparison groups if a SIM is chosen. 
Because Sloiff and Ploiff are comparable to Cohen s d 
[72], ESsi-pii is also a standardized value. Following 
Cohen [73], values for ESsi-pi above |0.2| are regarded 
as a relevant impact of non-invariant parameters on 
latent mean differences. 

To study the impact on group differences in composite 
means, we first computed standardized effect sizes (Cohens 
d) between comparison groups in composite scales in 
two ways: One (ALL^iff) by using all items of a scale 
(and thus implicitly assuming strict MI), and one by 
using a reduced scale with only strictly invariant items 
between two comparison groups (RED^iff). Then the terms 
ESpi-ALL = PIoiff-ALLDiff and ESpi.red = PiDifrREDDiff were 
computed. Assuming that Ploiff represents the "true" 
difference between comparison groups, ESpi_all and 
ESpi_RED indicate misestimation of group differences by 
using ALLoiff or REDoiff. Again, values for ESpi_all and 
ESpi_RED above |0.2| are regarded as relevant. Furthermore, 
by comparing ESpi_all and ESpi_red> it was examined 
whether deleting non-invariant items led to an improved 
estimation of group differences. 

Results 
Sample 

The sample comprised N = 1404 German rehabilitation 
patients (42% women, mean age = 56.4 years (SD = 12.2)) 
with different chronic conditions. All patients with or- 



thopedic conditions (e.g. chronic back pain) (n = 180), 
rheumatism (e.g. psoriatic arthritis, ankylosing spondylitis) 
(n = 312), asthma (n = 225) and COPD (n = 118) as well as 
n = 136 cancer patients were from the study presented 
in [54]. The sample was supplemented by an additional 
n = 433 cancer patients who also filled out the German 
heiQ™ at the beginning of their inpatient rehabilitation. 
From all cancer patients, n = 215 were diagnosed with pro- 
state cancer, n = 217 with colon or rectum cancer and 
n = 137 had another type of cancer. When analyzing MI 
across gender, patients with prostate cancer were excluded. 

Number, kind and magnitude of non-invariant parameters 
Gender In two scales, one item each did not show scalar 
invariance: Item 10 in Positive and active engagement in 
life (EPC = 0.12) and Item 9 in Health directed activities 
(EPC = 0.16). All other scales showed strict invariance 
across gender. 

Disease groups Table 2 shows fit indices for strict and 
partial invariance models and Table 3 shows results of 
invariance tests of specific parameters. One heiQ™ scale 
proved to be strictly invariant between all five disease 
groups {Social integration and support). Three scales 
{Emotional distress, Skill and technique acquisition, Health 
directed activities) showed at least scalar invariance among 
four conditions. Health service navigation was strictly in- 
variant between patients with orthopedic conditions 
and rheumatism on the one hand and patients with 
asthma, COPD, and cancer on the other. Constructive at- 
titudes and approaches showed strict invariance in three 
conditions (cancer, asthma, and orthopedic conditions). Ac- 
tive engagement in life showed only metric invariance be- 
tween all conditions, but at least scalar invariance among 
rheumatism, cancer, and COPD. Self -monitoring and 
insight showed metric invariance among patients with 
orthopedic conditions and cancer on the one hand and 
patients with asthma, COPD, and rheumatism on the 
other hand. Scalar invariance could not be established 
across any chronic condition group in this scale; however, a 
partial invariance model could be established. A total of 14 
items (35%) showed DIF in any analyzed parameter level in 
at least one disease group. However, 2-3 items showed DIF 
only in residual variances, which do not affect mean differ- 
ences between groups. Point estimates of EPCs for factor 
loadings and residual variances were only slightly above the 
defined values for 5; EPCs for intercepts ranged between 
0.10 and 0.34. 

Because of limited power, for some parameters in each 
scale it could not be concluded whether they exceed 5 
or not. However, point estimates of EPCs for these 
parameters were mostly low (a table with all EPCs and 
modification indices as well as power estimates may be 
offered on request). 
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Table 2 Fit-values for strict Invarlance models (SI) and partial invariance models (PI) among chronic conditions 



Scale 


Model 


Chi^ (df) 


P 


CFI 


RMSEA 


Positive and active engagement in life 


SI 


212.21 (99) 


<0.001 


0.868 


0.079 




PI 


121.99 (68) 


<0.001 


0.947 


0.053 


Health directed activites 


SI 


85.72 (49) 


<0.001 


0.975 


0.052 




PI 


69.16 (47) 


0.019 


0.985 


0.041 


Skill and technique acquisition 


SI 


62.966 (50) 


0.103 


0.986 


0.030 




PI 


45.08 (47) 


0.552 


1.000 


0.000 


Constructive attitudes and approaches 


SI 


1 64.04 (88) 


<0.001 


0.940 


0.063 




PI 


142.26 (72) 


<0.001 


0.952 


0.059 


Self-monitoring and insight 


SI 


434.91 (108) 


<0.001 


0.696 


0.104 




PI 


141.97 (92) 


<0.001 


0.953 


0.044 


Health service Navigation 


SI 


259.45 (76) 


<0.001 


0.870 


0.095 




PI 


160.20 (72) 


<0.001 


0.941 


0.066 


Social integration and support 


SI 

PI 


155.52 (76) 


<0.001 


0.960 


0.061 


Emotional distress 


SI 


255.75 (108) 


<0.001 


0.944 


0.070 




PI 


208.14 (105) 


<0.001 


0.961 


0.051 



Notes: SI: Strict invariance model; PI: Partial invariance model (non-invariant parameters see Table 4). 



Table 3 Results of tests of Ml across five chronic conditions, arranged by type of parameter 



Scale 


Configural Ml 




Metric Ml 




Scalar Ml 






Strict Ml 


Diag 


Item EPC 


Diag 


Item 


EPC 


Diag 


Item EPC 


Positive and active engagement in life 








ortho 


2 


-0.24 


copd 


2 0.13 










ortho 


5 


-0.18 














asthma 


2 


-0.17 














asthma 


10 


0.12 






Health-directed activities 


/ 




/ 


copd 


19 


0.24 




(^) 


Skill and technique acquisition 








copd 


23' 


-0.17 


ortho 


30 0.19 










asthma 


23' 


-0.09 






Constructive attitudes and approaches 




copd 


36 0.16 


rheuma 


36 


0.13 




(/) 


Self-monitoring and insight 




ortho 


If -0.14 


ortho 


3 


-0.10 


ortho^ 


11 -0.13 










ortho 


17 


-0.13 














asthma 


3' 


-0.22 














asthma 


17' 


-0.20 














copd 


3' 


-0.31 










cancer 


11' -0.15 


copd 


6 


0.34 














copd 


17' 


-0.21 














cancer 


6 


-0.28 






Health-service navigation 


/ 






ortho 


33' 


-0.21 














rheuma 


33' 


-0.29 






Social integration and support 










/ 








Emotional distress 








cancer 


12 


-0.19 


ortho 


7 -0.16 



Notes: Ml: measurement invariance; numbers ("Item") represent non-invariant heiQ™ items in the mentioned disease group ("Diag"), followed by EPC (Expected 
Parameter Change) with ortho = orthopedic conditions, rheuma = rheumatism; >/: all parameter invariant; (>/): no new DIP parameter, but parameters of items 
with DIP in a former stage were set free; ^invariant parameter in subgroups (for example item 3 has the same intercept in COPD and asthma); "^in item 1 1, 
orthopedic group and cancer group show same factor loadings and intercept, but differ in residual variances. 
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Impact on latent mean differences 

Gender In both scales showing one non-invariant item 
each, no relevant impact on latent or composite mean 
differences was found {Positive and active engagement in 
life: ESsi-Pi = 0.08, ESpi_all = 0.13, ESpi_red = 0.06; Health 
directed behavior: ESsi-pi = 0.06, ESpi_all = 0.09, ESpi_ 
RED < 0.01). 

Disease groups Table 4 shows coefficients for the impact 
of non-invariant items on both latent and composite mean 
differences among all five conditions for the two scales 
Positive and active engagement in life and Self-monitor- 
ing and insight. In all other scales, no relevant impact 
was found (exact values are shown in Additional file 2: 
Table S2, online-supplement). 

In Positive and active engagement in life, all compari- 
sons among orthopedic patients and other disease groups 
in latent means were affected in a relevant manner by 
non-invariant parameters (all ESsi-pi > 0.26). Accordingly, 
using the composite scale with all items, differences were 
also clearly misestimated (0.24 < ESpi_all ^ 0.32). Deleting 



the non-invariant items in the composite scale reduces 
this bias (0.03 < ESpi_red ^ 0.17). Ignoring non-invariant 
parameters did not have a relevant influence on any other 
latent or composite comparisons in this scale (all ESsi-pi 
and ESpi.ALL< |0.2|). 

Despite showing a complex pattern of non-invariant 
parameters, ignoring them in Self-monitoring and insight 
did not lead to relevant misestimation of latent mean 
differences (0.01 < ESsi-pi < 0.13). However, using com- 
posite scales with all items of the scale led to a relevant 
misestimation of mean differences in four comparisons 
(orthopedic vs. asthma, rheumatism vs. asthma, rheuma- 
tism vs. COPD, rheumatism vs. cancer). Again, deleting 
non-invariant items in the composite scales reduces this 
bias (all ESpi.red < |0.13|). 

Discussion 

As far as we know, this is the first review of studies on 
MI in generic constructs across disease groups and the 
first review on MI not restricted to a specific statistical 
technique. Studies of MI among diagnostic groups have 



Table 4 True standardized mean differences (PIdiff) and impact of non-invariant items on latent (ESsi-pi; lower triangle) 
and composite (ESri.all/ ESri.res/ upper triangle) mean differences 



Disease group 


Ortho 


Rheu Asthma 


COPD 


Cancer 


Positive and active engagement in life 



Ortho 






0.32' 


0.27' 


0.31' 


0.24' 


ESpi-ALL 








0.09 


0.17 


0.06 


0.03 


ESpi-REo 


Rheu 


P'oiff 


0.59 




0.08 


0.02 


0.10 


ESpi-ALL 




ESsi-Pi 


0.27' 




0.14 


0.05 


b 


ESpi-REO 


Asthma 


P'oiff 


0.20 


-0.43 




0.06 


0.02 


ESpi-ALL 




ESsi-Pi 


0.28' 


-0.02 




0.09 


0.04 


ESpi-REo 


COPD 


P'oiff 


0.47 


-0.13 


0.27 




0.08 


ESpi-ALL 




ESsi-Pi 


0.27' 


-0.13 


0.01 




0.12 


ESpi-REo 


Cancer 


P'oiff 


-0.03 


-0.64 


-0.24 


-0.52 








ESsi-pi 


0.28' 


>-0.01 


0.01 


>-0.01 






Self-monitoring and insight 


Ortho 






0.11 


0.22' 


0.12 


0.10 


ESpi-ALL 








0.02 


0.06 


0.02 


0.04 


ESpi-REo 


Rheu 


P'oiff 


0.24 




0.32' 


0.22' 


0.21' 


ESpi-ALL 




ESsi-pi 


0.08 




0.13 


0.08 


0.10 


ESpi-REo 


Asthma 


P'oiff 


-0.34 


-0.56 




0.09 


0.13 


ESpi-ALL 




ESsi-Pi 


-0.05 


-0.13 




0.03 


0.02 


ESpi-REo 


COPD 


P'oiff 


-0.16 


-0.38 


0.16 




0.03 


ESpi-ALL 




ESsi-pi 


-0.03 


-0.10 


0.03 




0.02 


ESpi-REo 


Cancer 


P'oiff 


-0.31 


-0.55 


0.04 


-0.15 








ESsi-pi 


<0.01 


-0.08 


0.06 


0.02 







Notes: Ortho: orthopedic condition; Rheu: rheumatism; P\om'. Estimations of latent mean differences in partial invariance models; ESsi-pi: Difference in latent mean 
differences between strict and partial invariance models; ESpi.all: Difference between latent mean differences in partial invariance models and composite mean 
differences using all items of a scale; ESpi.red: Difference between latent mean differences in partial invariance models and composite mean differences using only 
items with pairwise non-invariant parameters; ^relevant misestimation (ES > |0.2|); "^no item with DIP between groups. 
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become more prevalent in the last years; only one of 
the reviewed studies was published before 2000. Disease 
group appears to be increasingly recognized as an import- 
ant factor that may influence MI in a variety of generic 
constructs. 

At first glance, the results of both the review and the 
analyses of the heiQ™ seem to confirm the assumption 
that MI is an important aspect when applying generic 
instruments across disease groups. Over 80% of the 
examined questionnaires showed at least one item with 
non-invariant parameters; the mean proportion of non- 
invariant items was 36% (excluding studies that examined 
configural or factorial invariance only). Presumably, the 
actual number of distortions in MI may even be higher. 
First, only a few studies examined both uniform and 
non-uniform bias. Second, apart from the studies in the 
review, many studies did not examine MI directly, but 
analyzed factor structure and other parameters of a 
measure in specific conditions and compared results 
descriptively with results of other studies. These studies 
may underestimate lack of MI; hence, the number of items 
showing DIF may even be higher. Likewise, 35% of the 
heiQ™ items showed DIF in at least one disease group. 

However, items showing DIF did not always have an 
impact on the main research questions. It is difficult to 
assess whether non-invariant items of the reviewed stud- 
ies had relevant impact as only three studies [25,26,30] 
examined influences on (latent) mean differences, with 
only one showing a relevant impact [25]. Five studies 
examined impact of items with DIF on structural parame- 
ters indirectly, i.e. impact was explored via correlations 
of DIF-adjusted and non-adjusted values. Finally, none of 
the studies examined impact on either composite mean 
differences or on accuracy of selection. In contrast, we 
carried out a more detailed analysis of the heiQ™ where 
we demonstrated that seven scales included items with 
DIF. However, only few parameters were non-invariant 
in five of these scales and none of them had a relevant 
influence on latent or composite mean comparisons. 

The remaining two heiQ™ scales, however, showed sev- 
eral non-invariant parameters among disease groups. In- 
deed, partial invariance models among disorders could 
be proven but at least some group comparisons were 
affected by non-invariant parameters. 

Self-monitoring and insight: A complex pattern of non- 
invariant factor loadings and intercepts among the five 
disease groups indicating partial invariance was found 
in this scale. This pattern may best be interpreted as a 
reflection of clinical differences among disease groups. 
For example, item 11 asks patients whether they know 
how and when to take their medicine. However, use of 
medication may have greater importance to patients in 
some conditions (e.g. rheumatism or asthma) than in 



others (e.g. chronic back pain). Another example is item 3 
asking patients about their self-monitoring activities. 
Asthma patients show a lower intercept (difficulty) than 
both rheumatic and cancer patients in this item. Asthma 
patients may well be more motivated to monitor their 
health than rheumatic patients or cancer patients are, 
because an immediate intervention (e.g. using an inhaler) 
has a direct effect on their health status. Interestingly, 
despite the complex pattern of non-invariant items, only a 
small impact on latent means was detected. Still, some 
composite mean comparisons were clearly affected. 

Active engagement in life: Patients with orthopedic con- 
ditions (i.e. chronic back pain) showed lower intercepts in 
item 5 ("I try to make the most of my life") and item 2 
("Most days Tm doing some of the things I really enjoy"), 
resulting in a relevant impact on latent and composite 
mean differences. A possible explanation may be that 
psychosocial factors play a larger role in chronic back 
pain than in other conditions; therefore, patients may pay 
more attention to stress-reducing activities. However, this 
explanation is highly speculative. More research is needed 
to clarify these issues. 

The review showed that a higher amount of non- 
invariant items was found in studies that examined 
physical functioning. A possible explanation might be 
that people with different somatic diagnoses differ in 
how strong different areas of activity are affected. A 
general hypothesis would be that the more a measured 
construct is influenceable by the kind of disease, the 
higher is the probability that indicators of the construct 
show DIF between disease groups. The high number of 
items showing DIF in Self-monitoring and insight would 
be in line with this hypothesis. 

The results also clarified that DIF should not only be 
regarded as an aspect of an item as such, but, in many 
cases, as an interaction between item and disease group. 
Many heiQ™ items showed DIF only in one of the five 
comparison groups. Similar results were presented in 
some reviewed studies. For example, many items in one 
study [43] showed DIF only between two out of three 
compared disease groups. 

Limitations 

Many statistical methods have been developed to examine 
MI, but it remains unclear which method is the most 
appropriate one to use. For example, the statistical 
method used in the present study differs from the often 
recommended CFA-procedure that tests for MI by 
comparing global fit-values (for example chi^-difference 
test or differences in CFI) [4,11,13,74]. The outlined 
procedure in this study may be more sensitive to detect 
"truly" non-invariant items, because the magnitude of the 
EPC and the power of modification indices are taken into 
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account. However, values of EPC and MI depend on the 
correctness of all other model parameters [20]. If more 
than one parameter is non-invariant, EPCs and Mis may 
also be misleading. Furthermore, the power for each 
examined parameter varied greatly, due to different sample 
sizes in disease groups or different sizes of model parame- 
ters in different heiQ™ scales. This may have influenced the 
presented results. More studies that compare different pro- 
cedures for examining invariance are needed. 

As (non-)invariance is a continuum rather than a di- 
chotomous state [10], the results of all studies about MI 
highly depend on the choice of adequate cut-off-values 
for magnitude and impact, respectively. We used very 
strict cut-off values in the present study, leading to a 
high sensitivity to detect potential non-invariant items. 
Choosing other cut-of-values may have reduced or in- 
creased the number of DIF-items. Higher cut-off values 
may also reduce the numbers of inconclusive compari- 
sons. Up to now, only little guidance can be found in 
the Uterature for selecting values for 5. Furthermore, 
few studies proposed effect size measures for estimating 
impact [75,76]. More empirical and simulation studies 
are needed to help researchers define relevant cut-off 
values for both magnitude and impact for all statistical 
approaches examining MI (for another solution to these 
problems using Bayes analyses, see [77]). 

Furthermore, it is not known whether results of MI- 
analyses between disease groups are consistent across 
languages and cultural groups. Future work that simul- 
taneously explores cross-cultural and disease-specific MI 
issues seems warranted to generate information on the 
presence and magnitude of bias in evaluating chronic 
disease programs across countries. 

Conclusion 

Since most heiQ™ scales showed strict invariance across 
gender and non-invariant items did not affect mean differ- 
ence between men and women in a relevant manner, the 
heiQ™ can be used to compare men and women without 
any adjustments. In six scales, comparisons of mean 
differences among disease groups were also not affected 
by invariant items, again suggesting that no adjustments 
have to be made. This study showed that the heiQ™ is a 
robust tool for studies within disease groups and is likely 
to be an unbiased measure in controlled studies with 
balanced samples across disease groups. However, in 
studies with unbalanced disease groups the Self-man- 
agement and insight and Positive and active engagement 
in life scales should be checked for distortions of MI. 
To adjust for MI, we suggest comparing latent means 
of partial invariance models instead of deleting non- 
invariant items [5]. 

This study demonstrates that a lack of MI across disease 
groups in generic instruments is common; maybe more 



common than in other socio-demographic variables like 
gender. However, its clinical impact remains unclear. 
Generally, routine examinations of the presence of invari- 
ance seems to be warranted, particularly when testing 
hypotheses around disease group differences and in set- 
tings where researchers are seeking to develop generic 
instruments for applications across disease groups [10]. 
This field will be advanced by more systematic studies 
of MI across disease groups and other clinically rele- 
vant variables. This entails simulation studies focusing 
particularly on the relationship between magnitude and 
clinical impact of DIF as well as qualitative methods to 
elucidate sources of DIF. 
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